Recapping

Session 3 covered:

  • R markdown

  • Writing functions

  • Apply (again)

  • Installing packages

  • Using packages

  • Real life example: DESeq2

Recapping as code

# Functions
multiply = function(x, y) return (x * y)

# Default arguments
multiply = function(x, y=2) return (x * y)

# Scopes
multiply = function(x, y) z <<- x * y

# Passing through arguments
apply(m, 1, multiply, y=2)

# Packages from CRAN
install.packages("ggplot2")

# Packges from Bioconductor
library(BiocManager)
install("DESeq2")

Homework

What did you learn?

Do we need to recap any parts?

Learning objectives

Session 4

  • What makes a good figure?

  • Using colour

  • viridis colour library

  • Plotting with ggplot2

Plotting

Humans are visual creatures!

Visualisation of large data sets is an essential task in molecular biology and medicine. When done effectively, images can help you to explain the most complex of data.

Without an image or summarisation, how well would you do finding significant genes passing a given fold change threshold in a table like this?

##                    baseMean log2FoldChange      lfcSE       pvalue         padj
## PENG0000000001    1.1941808    -0.11077611 0.46558089 1.822729e-01 1.000000e+00
## PENG0000000002    0.1394510    -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000004    0.7393672    -0.01877427 0.39858386 7.883606e-01 1.000000e+00
## PENG0000000006   44.3468742     0.08682659 0.19961999 4.041232e-01 6.352165e-01
## PENG0000000007  788.9630161     1.65624481 0.20412453 2.062683e-18 1.994904e-16
## PENG0000000009 1345.8691348    -0.15601837 0.13576046 9.532748e-02 2.327564e-01
## PENG0000000010    4.4468600     0.04479898 0.29043812 6.594170e-01 1.000000e+00
## PENG0000000012  671.1614707    -0.14329193 0.10518787 7.279070e-02 1.885944e-01
## PENG0000000014    0.1394510    -0.01310322 0.42899817 8.004145e-01 1.000000e+00
## PENG0000000015  770.7712490     0.70623622 0.15143560 1.487255e-07 2.841074e-06
## PENG0000000016    5.3302466     0.14048684 0.33883106 2.315969e-01 1.000000e+00
## PENG0000000022 1276.9182728    -0.06159326 0.07899988 3.184644e-01 5.180787e-01

What makes a good figure?

There are a few key points you need to condider when deciding upon a method of visualisation:

  • relevance: the message a figure needs to convey

  • salience: how easily the eye can distringuish your message from the background

  • accuracy: how exactly different visualisation methods may convey your message

Especially when presenting (when the audience is listening and reading), it’s vital that salience and relevance are aligned.

Salience

Human vision is highly selective. We understand visual information by selecting, in turn, individual objects or aspects for detailed analysis rather than by appreciating an entire scene.

Formally, salience is the property of an object that sets it apart from its surroundings; it’s a relative properly, therefore, and depends on the collection of objects being visualised.

We can enhance salience by manipulating color, shape, size, and position to focus attention.


It’s not inherently different to make information salient …

… what’s important is that the salient points of a figure align with what’s relevant.

Nevertheless, it’s also easy to reduce salience by:

  • displaying too much information together

  • attempting to convey different points of relevance within the same image

  • referring to points of relevance that are tangential to what’s salient

Accuracy

When displaying a graphic, we want the viewer to be able to perceive the patterns and trends that convey the relevant point.

Humans are better able to interpret certain visual cues better than others, however - interpretation is subjective and everyone’s different! Let’s try and rank the following methods of conveying the same information:


Research in the field of visual theory has shown that, in order, people are best able to understand:

  1. positions on a common aligned scale

  2. positions on unaligned scales that are otherwise common

  3. lengths

  4. angles and slopes

  5. area

  6. volume and colour saturation

  7. colour hue

Colour

Unfortunately, although it’s often a go-to method, colour is amongst the least reliable methods of conveying information!

This is worsened by the fact that colour perception is relative. We distinguish colours differently depending on their surroundings and can be easily tricked into seeing the same colour differently or into seeing different colours similarly.


Colour vision deficiency

Not only can be we be fooled into perceiving colours contextually, people differ in their ability to distinguish colours based on their genetics.

Colour blindness (colour vision deficiency or CVD) is a sliding scale and there are a number of different types. ‘Full’ phenotypes for the three more common forms of CVD are show here.


Across populations with Northern European ancestry, up to 1/12 males and 1/200 females have some level of red-green CVD. UK-wide, 4.5% of the population have some level of CVD and, even by the time they leave school, approximately 40% are unaware.

When producing figures that make use of colour, if we want them to be salient, it’s important to be inclusive! Thankfully, a number of colour scales have been developed that allow figures to retain salience for those with CVD.

Viridis colour library

The viridis library provides color maps for use in R that are:

  • colourful, spanning as wide a palette as possible

  • perceptually uniform, such that, across the whole range, nearby values have similar-appearing colors and distant values appear more distinct

  • friendly for those with colour vision deficiency

viridis can be installed from CRAN.

install.packages("viridis")

Plotting in R

Although base R can produce a variety of plots, getting them to be ‘just so’ can be extremely difficult!

These days, virtually all ‘pretty’ plotting is done with the ggplot2 library.

install.packages("ggplot2")

Given its popularity, lots of libraries interface with ggplot2 (e.g. viridis). To complement ggplot2, we’ll also install patchwork, which helps to layout plots in rows and grids.

install.packages("patchwork")
library("ggplot2")
library("patchwork")

ggplot

The central ggplot() function provides a consistent interface to map data to the aesthetics of geometries.

In other words:

  • the tabular data that we wish to plot …

  • … has various facets, which we map to the aesthetic properties we can perceive (position, colour, shape, size, and transparency) …

  • … of various geometric methods of displaying the data (bars, points, lines, etc)


Most commonly, we make a plot by passing the ggplot() function two arguments:

  • data, which takes a data frame (or something that can be coerced to one)

  • mapping, which uses the aes() function to define the aesthetic mappings we’d like to use

penguins = na.omit(read.csv("data/1_palmerpenguins.csv"))
penguins$species = factor(penguins$species)
penguins$island = factor(penguins$island)
penguins$sex = factor(penguins$sex)
penguins$year = factor(penguins$year, ordered=TRUE)
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm))
class(g)
## [1] "gg"     "ggplot"

Here, we’ve instantiated a ggplot object with the penguins data frame and defined two simple aesthetics - that the x axis should display bill_length_mm and that y should display bill_depth_mm.


Let’s have a look at what we’ve made.

g

Beautiful, we’re finished!

Remember that a ggplot requires data, aesthetics, and geometries. So far, we haven’t added any geometries to our plot. We can think of a ggplot as a painting; ggplot() provides the canvas and the geometries provide the layers of paint.

Geometries

Geometries (geom_...() functions) added to a ggplot2 display data according to the aesthetics defined in the ggplot object. Geometries are added (literally, as we use the + operator) to the ggplot and we reassign the result to the original object variable name.

g = g + geom_point()
g

Now we can see something!

Aesthetics

What we have so far is a little simplistic! Let’s re-work the aesthetics by getting the ggplot object colour according to the levels of the species factor (color and col are also valid if you’re not British).

g =
  ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species)) +
  geom_point()
g


The aesthetic properties defined in the ggplot() function call are set as the defaults for all geometries, provided they’re applicable. All geometries added will be passed these defaults.

g +
  geom_smooth(method="lm")

Here, the linear model (method="lm") trend lines we add using geom_smooth() inherit their colour aesthetic from the ggplot parent object.

Practical

Let’s make our first ggplot!

  • Set up a ggplot object of displaying flipper_length_mm against body_mass_g

  • Add a colour aesthetic to the plot

  • Add a geom_smooth(method="lm") geometry to the plot

  • What happens if we exchange geom_point() for geom_density2d()?

Partitioning with aesthetics

The more factors we map to aesthetics, the more we partition the data.

g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species, shape=sex))
g +
  geom_point() +
  geom_smooth(method="lm")

Here, the shape=sex aesthetic further subdivided the data by sex, giving 6 trend lines instead of 3.


We don’t always want our geometries to inherit all of the defaults passed to the ggplot() function, therefore! Thankfully:

  • individual geometries can have alternate aesthetics passed using the mapping= argument, which allows individual aesthetic parameters to be reset to NULL.

    geom_point(mapping=aes(shape=NULL))

  • aesthetics can be manually specified for all data by passing the aesthetic parameter as an argument itself

    geom_point(colour="black")


Here, we pass alpha=0.5 to geom_point() to manually set the alpha (transparency) of the points. Additionally, we override the shape aesthetic within geom_smooth() by setting it to NULL so that we don’t duplicate our trend lines.

g = 
  g +
  geom_point(alpha=0.5) +
  geom_smooth(mapping=aes(shape=NULL), method="lm")
g


Depending on the geometry, different aesthetic options are more relevant than others. Some aesthetics are only available for specific geometries.

As a general guide, discrete variables are best visually separated with …

  • shape for very small numbers of discrete groups and where overlapping is minimal

  • linetype for very small numbers of discrete groups

  • colour or fill for filled geometries (where colour alters the border) using separate hues

… whereas continuous variables are most accurately displayed using:

  • size or lineweight

  • alpha

  • saturation of a single colour

Themes

Separately to the aesthetics of the geometries, we can control other visual aspects of the plot by modifying the theme() defaults or by using one of the theme_...() presets. We can also use the labs() function to control the axis and plot titles.

g +
  theme_bw() +
  labs(x="Bill length (mm)", y="Bill depth (mm)", title="Graph with theme_bw()")

Practical

Let’s add some style to our ggplot.

  • Update your previous plot to use a theme preset. Try out theme_bw(), theme_classic(), and theme_minimal().

  • Even when using a theme preset, we can still override specific elements using theme(). What does adding theme(axis.text.y=element_text(angle=90, vjust=0.5, hjust=0.5)) achieve?

  • How might we rotate the labels for the x axis?

  • Looking at the help for element_text(), how might we change the size of the plot.title element?

Different geometries

There are many geometries available to help display data of different formats. The majority of graphing applications fall into five groups:

  • single continuous variable: geom_freqpoly(), geom_histogram(), geom_area(), geom_density()

  • single discrete variable: geom_bar()

  • two continuous variables: geom_point(), geom_smooth(), geom_rug(), geom_density_2d()

  • two variables, one continuous and one discrete: geom_boxplot(), geom_violin(), geom_dotplot(), geom_jitter(), geom_col()

  • two discrete variables: geom_count(), geom_jitter()


Let’s see a few options for plotting a continuous against a discrete variable.

Different geometries can be used to highlight - make salient - different aspects of the data. Here, geom_boxplot() better shows the position of the median value, whereas geom_violin() better highlights the spread of the data and geom_jitter() might do that too much!


Let’s see a few options for plotting two continuous variables against each other.

Here, as an alternative to geom_point(), geom_density2d() may better highlight the ‘centre of mass’ for each group. geom_rug() might be suitable alongside another geometry but otherwise lacks both accuracy and saliency.


Practical

Using this base …

ggplot(penguins, aes(x=body_mass_g, colour=species))

… let’s compare a couple of options for plotting a single continuous variable.

  • Add a geom_histogram() to the plot

  • Switch that for a geom_freqpoly()

  • Which plot has better accuracy and saliency?

Using colour properly

Let’s apply some of our knowledge about using colour effectively … and inclusively!

The viridis library integrates easily with ggplot2 using various scale_colour_viridis_...() functions:

  • scale_colour_viridis_d() makes salient mappings of discrete variables

  • scale_colour_viridis_c() makes accurately distinguishable mappings for continuous variables

  • scale_colour_viridis_b() merges the two - continuous variables are binned to enhance salience


Let’s start by replacing ggplot2’s default colour scheme for discrete variables using the scale_colour_viridis_d() (d for discrete) function.

ggplot(penguins, aes(x=body_mass_g, y=flipper_length_mm, colour=species)) +
  geom_point() +
  theme_bw() +
  scale_colour_viridis_d()

Pretty easy!


As we saw earlier, viridis has a variety of palettes. Let’s compare how good they are at visually discriminating discrete data by passing the option= argument to the scale_colour_viridis_d() function.

Scales

Stats

Facets

Saving

Homework

There are more penguin-related homework tasks to help cement what we’ve covered today!

The homework and instructions can be found within the main directory for the course: ./homework/Homework_4.Rmd